Data Processing
Visualization


Michael Clark
Statistician Lead


Outline

Part 1

  • Overview of Data Structures
  • Input/Output
  • Vectorization and Apply functions

Part 2

  • Pipes, and how to use them
  • plyr, dplyr, tidyr
  • data.table

Part 3

  • Visualization with ggplot2
  • Adding Interactivity

Part 1

Data Structures

Data structures

R has several core data structures:

  • Vectors
    • Factors
  • Lists
  • Matrices/arrays
  • Data frames

Vectors

Vectors form the basis of R data structures.

Two main types- atomic and lists, but I will treat lists separately.

Here is an R vector.

The elements of the vector are numeric values.

x = c(1, 3, 2, 5, 4)
x
[1] 1 3 2 5 4

Vectors

All elements of an atomic vector are the same type.

Examples include:

  • characters
  • numeric (double)
  • integer
  • logical

Factors

A important type of vector is a factor.

Factors are used to represent categorical data structures.

x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x
[1] q              V              what the heck?
Levels: q V what the heck?

Factors

The underlying representation is numeric.

But, factors are categorical.

They can’t be used as numbers would be.

as.numeric(x)
[1] 1 2 3
sum(x)
Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors

Matrices

With multiple dimensions, we are dealing with arrays.

Matrices are 2-d arrays, and extremely commonly used.

The vectors making up a matrix must all be of the same type.

  • e.g. all values in a matrix must be numeric.

Creating a matrix

Creating a matrix can be done in a variety of ways.

# create vectors
x = 1:4
y = 5:8
z = 9:12

rbind(x, y, z)   # row bind
  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    5    6    7    8
z    9   10   11   12
cbind(x, y, z)   # column bind
     x y  z
[1,] 1 5  9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Lists

Lists in R are highly flexible objects.

They can contain anything as their elements, even other lists.

  • unlike vectors, whose elements must be of the same type.

Here is a list. We use the list function to create one.

x = list(1, "apple", list(3, "cat"))
x
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] "cat"

Lists

We often want to loop some function over a list.

for(elem in x) class(elem)

Lists can, and often do, have named elements.

x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]
$b
[1] -1

Data Frames

data.frames are a very commonly used data structure.

They do not have to have the same type of element.

This is because the data.frame class is actually just a list.

As such, everything about lists applies to data.frames.

But they can also be indexed by row or column as well

  • like matrices.

Creating a data frame

mydf = data.frame(a = c(1,5,2),
                  b = c(3,8,1))

We can add row names also.

rownames(mydf) = paste0('row', 1:3)
mydf
     a b
row1 1 3
row2 5 8
row3 2 1

Input/Output

Input/Output

Standard methods of reading in data

  • read.table
  • read.csv
  • readLines

Using the foreign package:

  • read.spss
  • read.xport

Note: the foreign package is no longer useful for Stata files.

Newer approaches

haven: Package to read in foreign statistical files

  • read_spss
  • read_dta

readxl: for excel files

Faster approaches

readr: Faster versions of base R functions

  • read_csv
  • read_delim

These make assumptions after an initial scan of the data.

If you don’t have ‘big’ data, this won’t help much.

However, they actually can be used as a diagnostic.

  • pick up potential data entry errors.

Faster approaches

data.table: faster read.table

  • fread

Typically faster than readr approaches.

Other Data

Note that R can handle many types of data.

Some examples:

  • JSON
  • SQL
  • XML
  • YAML
  • MongoDB
  • NETCDF
  • text (e.g. a novel)
  • shapefiles
  • google spreadsheets

And many, many others.

On the horizon

feather: designed to make reading and writing data frames efficient

Works in both Python and R.

Still in early stages of development.

Indexing

Base R Indexing Refresher

Slicing vectors

letters[4:6]
[1] "d" "e" "f"
letters[c(13,10,3)]
[1] "m" "j" "c"

Slicing matrices/data.frames

myMatrix[1, 2:3]

Base R Indexing Refresher

Label-based indexing:

mydf['row1', 'b']

Position-based indexing:

mydf[1, 2]

Base R Indexing Refresher

Mixed indexing:

mydf['row1', 2]

If the row/column value is empty, all rows/columns are retained.

mydf['row1',]
mydf[,'b']

Base R Indexing Refresher

Non-contiguous:

mydf[c(1,3),]

Boolean:

mydf[mydf$a >=2,]

Base R Indexing Refresher

List/Data.frame extraction

[ : grab a slice of elements/columns

[[ : grab specific elements/columns

$ : grab specific elements/columns

List/Data.frame extraction

my_list_or_df[2:4]
my_list_or_df[['name']]
my_list_or_df$name

Vectorization

Boolean Indexing

Logicals are objects with values of TRUE or FALSE.

Assume x is a vector of numbers.

idx = x > 2
idx
x[idx]

Flexiblity

We don’t have to create a Boolean object before using it.

R indexing is ridiculously flexible.

x[x > 2]
x[x != 3]
x[ifelse(x > 2, T, F)]
x[{y = idx; y}]

Vectorized operations

Consider the following loop:

for (i in 1:nrow(mydf)) {
  check = mydf$x[i] > 2
  if (check==TRUE){
    mydf$y[i] = 'Yes'
  } else {
    mydf$y[i] = 'No'
  }
}

Vectorized operations

Compare:

mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'

This gets us the same thing, and would be much faster.

Vectorized operations

Boolean indexing is an example of a vectorized operation.

The whole vector is considered.

  • Rather than each element individually

This is always faster.

Vectorized operations

Log all values in a matrix.

mymatrix_log = log(mymatrix)

Way faster than looping over elements, rows or columns.

Vectorized Operations

Many vectorized functions already exist in R.

They are often written in C, Fortran etc., and so even faster.

Apply functions

A family of functions allows for a succinct way of looping.

Common ones include:

  • apply
  • lapply, sapply, vapply
  • tapply
  • mapply
  • replicate

Apply functions

  • apply
    • arrays, matrices, data.frames
  • lapply, sapply, vapply
    • lists, data.frames, vectors
  • tapply
    • grouped operations (table apply)
  • mapply
    • multivariate version of sapply
  • replicate
    • similar to sapply

Example

Standardizing variables.

for (i in 1:ncol(mydf)){
  x = mydf[,i]
  for (j in 1:length(x)){
    x[j] = (x[j] - mean(x))/sd(x)
  }
}

The above would be a really bad way to use R.

stdize <- function(x) {
  (x-mean(x))/sd(x)
}

apply(mydf, 2, stdize)

Timings

The previous demonstrates how to use apply.

However, there is a scale function in base R.

Unit: milliseconds
       expr        min          lq        mean     median         uq        max neval
 doubleloop 3112.41884 3130.286411 3198.740874 3144.29025 3227.45382 3663.90853    25
 singleloop   31.59734   32.022406   33.439865   32.69933   34.70190   38.34710    25
       plyr  132.64410  133.432096  139.466588  134.74993  136.99264  242.11898    25
      apply   33.99555   34.213489   35.698046   35.84892   36.95722   37.97816    25
   parApply   21.00966   21.769137   26.662488   22.58505   24.17335   72.32103    25
 vectorized    8.01776    8.635249    9.896537   10.31631   10.45710   13.24826    25

Apply functions

Benefits

  • Cleaner/simpler code
  • Potentially more reproducible
    • more likely to use generalizable functions
  • Parallelizable

NOT faster than explicit loops.

  • single loop over columns was as fast as apply
  • Replicate and mapply are especially slow

ALWAYS can potentially be faster than loops.

  • Parallelization: parApply, parLapply etc.

Personal experience

I use R every day, and rarely use explicit loops.

  • Note: no speed difference for a for loop vs. using while
  • If you must use an explicit loop, create an empty object and fill in
    • Faster

I never use a double loop.

Apply functions

Apply functions should be a part of your regular R experience.

Other versions we’ll talk about have been optimized.

However, you need to know the basics in order to use those.

Any you still may need parallel versions.

Part 2

Pipes

Note:

More detail on much of this part is given in another workshop.

Pipes

Operators that send what comes before to what comes after.

There are many different pipes.

There are many packages that use their own.

However, the vast majority of packages use the same pipe:

%>%

Pipes

Here, we’ll focus on their use with the dplyr package.

Later, we’ll use it for visualizations.

Example.

mydf %>% 
  select(var1, var2) %>% 
  filter(var1 == 'Yes') %>% 
  summary

Start with a data.frame %>%

    select columns from it %>%

    filter/subset it %>%

    get a summary

Using variables as they are created

We can use variables as soon as they are created.

mydf %>% 
  mutate(newvar1 = var1 + var2,
         newvar2 = newvar1/var3) %>% 
  summarise(newvar2avg = mean(newvar2))

Pipes for Visualization (more later)

Generic example.

basegraph %>% 
  points %>%
  lines %>%
  layout

The dot

Most functions are not ‘pipe-aware’ by default.

Example: pipe to a modeling function.

mydf %>% 
  lm(y ~ x)  # error

Other pipes can handle this.

  • e.g. %$% in magrittr

But generally, one can use a dot.

  • The dot refers to the object before the pipe.
mydf %>% 
  lm(y ~ x, data=.)

Flexibility

Piping is not just for data.frames.

  • The following starts with a character vector.
  • Sends it to a recursive function (named ..).
  • .. is created on-the-fly.
  • After the function is created, it’s used on ., representing the string.
  • Result: pipes between the words.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
  .. <-  . %>%
    if (length(.) == 1)  .
    else paste(.[1], '%>%', ..(.[-1]))
  ..(.)
} 
[1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
  • Put that in your pipe and smoke it René Magritte!

Pipes

Pipes are best used interactively.

Extremely useful for data exploration.

Common in many visualization packages.

See the magrittr package for more pipes.

plyr, dplyr, tidyr

plyr

Original data management package of the three.

More general than dplyr.

Not as useful for most common operations, but contains:

  • more flexible versions of the apply family
  • some very useful functions not found elsewhere

plyr

adply, dlply etc.

  • First letter represents the current object (array, data.frame, list)
  • Second letter represents the returned object
library(plyr)
x = list(var1=1:5, var2=2:6)
ldply(x)
   .id V1 V2 V3 V4 V5
1 var1  1  2  3  4  5
2 var2  2  3  4  5  6
ldply(x, sum)
   .id V1
1 var1 15
2 var2 20

Option to parallelize.

plyr: some useful functions

*ply: apply style functions, with parallel capability

join_all: Recursively join a list of data frames

rbind.fill: row bind data.frames, filling in missing columns.

mapvalues/revalue: replace values

round_any: Round to multiple of any number.

dplyr

Grammar of data manipulation.

Next iteration of plyr.

Focused on tools for working with data frames.

  • Over 100 functions

It has three main goals:

  • Make the most important data manipulation tasks easier.

  • Do them faster.

  • Use the same interface to work with data frames, a data tables or database.

dplyr

Some key operations:

select: grab columns

  • select helpers: one_of, starts_with, num_range etc.

filter/slice: grab rows

group_by: grouped operations

mutate/transmute: create new variables

summarize: summarize/aggregate

do: arbitrary operations

dplyr

Various join/merge functions.

Little things like:

  • n, n_distinct, nth, n_groups, count, recode, between

No need to quote variable names.

An example

Let’s say we want to select from our data the following variables:

  • Start with the ID variable
  • The variables X1:X10, which are not all together, and there are many more X columns
  • The variables var1 and var2, which are the only var variables in the data
  • Any variable that starts with XYZ

How might we go about this?

Some base R approaches

Tedious, or typically two steps just to get the columns you want.

# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]

# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]

# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)

More

What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]

And this is for fairly straightforward operations.

An alternative

newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>% 
  top_n(var2, n=50) %>% 
  arrange(desc(var1))

An alternative

dplyr and piping is an alternative

  • you can do all this sort of stuff with base R
  • with, within, subset, transform, etc.

Even though the initial base R approach depicted is fairly concise, it still can potentially be:

  • noisier
  • less legible
  • less amenable to additional data changes
  • requires esoteric knowledge (e.g. regular expressions)
  • often requires new objects (even if we just want to explore)

tidyr

Two primary functions for manipulating data

  • gather: wide to long
  • spread: long to wide

Other useful functions include:

  • unite: paste together multiple columns into one
  • separate: complement of unite

Example

library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
                      X = rnorm(10, 0, 1),
                      Y = rnorm(10, 0, 2),
                      Z = rnorm(10, 0, 4) )
stocks %>% head
        time           X          Y         Z
1 2009-01-01  0.23465359 -1.9089778 -7.037391
2 2009-01-02  0.87151932  0.2355249 -2.090847
3 2009-01-03  0.03584969 -1.4706570  4.853836
4 2009-01-04 -0.74729694 -0.2460126  1.775394
5 2009-01-05 -0.45235779  1.2348015 -1.189270
6 2009-01-06 -0.49231946  1.6157330  3.669595
stocks %>% gather(stock, price, -time) %>% head
        time stock       price
1 2009-01-01     X  0.23465359
2 2009-01-02     X  0.87151932
3 2009-01-03     X  0.03584969
4 2009-01-04     X -0.74729694
5 2009-01-05     X -0.45235779
6 2009-01-06     X -0.49231946

Personal Opinion

The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.

Extremely useful for data exploration and visualization.

  • No need to create/overwrite existing objects
  • Can overwrite columns as they are created
  • Makes it easy to look at anything, and do otherwise tedious data checks

Drawbacks:

  • not as fast as data.table for many things
  • the mindset can make for unnecessary complication
    • e.g. no need to pipe etc. to create one new variable

On the horizon

multidplyr

Partitions the data across a cluster.

Faster than data.table (after partitioning)

Data.Table

data.table

data.table works in a notably different way than dplyr.

However, you’d use it for the same reasons.

Like dplyr, the data objects are both data.frames and a package specific class.

Faster subset, grouping, update, ordered joins and list columns

data.table

In general, data.table works with brackets as in base R.

However, the brackets work like a function call!

  • Several key arguments
x[i, j, by, keyby, with = TRUE, ...]

Importantly:

you can’t use the brackets as you would with data.frames.

library(data.table)
df = data.table(x=sample(1:10, 6), g=1:3, y=runif(6))
df[,4]
[1] 4

data.table

x[i, j, by, keyby, with = TRUE, ...]

What i and j can be are fairly complex.

In general, you use i for filtering by rows.

df[2]
df[2,]
   x g         y
1: 6 2 0.2566232
   x g         y
1: 6 2 0.2566232

data.table

x[i, j, by, keyby, with = TRUE, ...]

In general, you use j to select (by name!) or create new columns.

  • Define a new variable with :=
df[,x]
df[,z:=x+y]  # df now has a new column
[1] 7 6 3 9 1 5
   x g         y        z
1: 7 1 0.3955308 7.395531
2: 6 2 0.2566232 6.256623
3: 3 3 0.7082341 3.708234
4: 9 1 0.7223548 9.722355
5: 1 2 0.5466711 1.546671
6: 5 3 0.8481999 5.848200

data.table

Dropping columns is awkward.

  • because j is an argument
df[,-y]             # creates negative values of y
df[,-'y', with=F]   # drops y, but now needs quotes
df[,y:=NULL]        # drops y, but this is just a base R approach
df$y = NULL
[1] -0.3955308 -0.2566232 -0.7082341 -0.7223548 -0.5466711 -0.8481999
   x g        z
1: 7 1 7.395531
2: 6 2 6.256623
3: 3 3 3.708234
4: 9 1 9.722355
5: 1 2 1.546671
6: 5 3 5.848200
   x g        z
1: 7 1 7.395531
2: 6 2 6.256623
3: 3 3 3.708234
4: 9 1 9.722355
5: 1 2 1.546671
6: 5 3 5.848200

Grouped operations

group-by, with creation of a new variable.

Note that these actually modify df in place.

df1 = df2 = df
df[,sum(x,y), by=g]                  # sum of all x and y values
   g V1
1: 1 42
2: 2 33
3: 3 34
df1[,newvar := sum(x), by=g]         # add new variable to the original data 
   x g        z newvar
1: 7 1 7.395531     16
2: 6 2 6.256623      7
3: 3 3 3.708234      8
4: 9 1 9.722355     16
5: 1 2 1.546671      7
6: 5 3 5.848200      8
df1
   x g        z newvar
1: 7 1 7.395531     16
2: 6 2 6.256623      7
3: 3 3 3.708234      8
4: 9 1 9.722355     16
5: 1 2 1.546671      7
6: 5 3 5.848200      8

Grouped operations

We can also create groupings on the fly.

For a new summary data set, we’ll take the following approach.

df2[, list(meanx = mean(x), sumx = sum(x)), by=g==1]
       g meanx sumx
1:  TRUE  8.00   16
2: FALSE  3.75   15

Faster!

  • joins: and easy to do (note that i can be a data.table)
df1[df2]
  • group operations: via setkey
  • reading files: fread
  • character matches: e.g. via chmatch

Timings

The following demonstrates some timings from here.

  • Reproduced on my own machine
  • based on 50 million observations
  • Grouped operations are just a sum and length on a vector.

By the way, never, ever use aggregate. For anything.

          fun elapsed
1:  aggregate  114.35
2:         by   24.51
3:     sapply   11.62
4:     tapply   11.33
5:      dplyr   10.97
6:     lapply   10.65
7: data.table    2.71

Ever.

Really.

Pipe with data.table

Can be done but awkward at best.

mydf[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydf[,newvar:=mean(x), 
  ][,newvar2:=sum(newvar), by=group
  ][,-'y', with=FALSE
  ]

Probably better to just use a pipe and dot approach

mydf[,newvar:=mean(x),] %>% 
  .[,newvar2:=sum(newvar), by=group] %>% 
  .[,-'y', with=FALSE]

My take

Faster methods are great to have.

  • Especially for group-by and joins.

Drawbacks:

  • Complex
  • The syntax can be awkward
  • It doesn’t work like a data.frame
  • Piping with brackets

Compromise

If speed and/or memory is (potentially) a concern, data.table

For interactive exploration, dplyr

Piping allows one to use both, so no need to choose.

And on the horizon…

dtplyr

Coming soon to an R near you.

This implements the data table back-end for ‘dplyr’ so that you can seamlessly use data table and ‘dplyr’ together.

Or play with now.

package timing
dplyr 10.97
data.table 2.71
dtplyr 2.70

Part 3

ggplot2

ggplot2

ggplot2 is an extremely popular package for visualization in R.

  • and copied in other languages/programs

It entails a grammar of graphics.

  • Every graph is built from the same few parts

Key ideas:

  • Aesthetics
  • Layers (and geoms)
  • Piping
  • Facets
  • Themes
  • Extensions

ggplot2

Strengths:

  • Ease of getting a good looking plot
  • Easy customization
  • A lot of data processing is done for you
  • Clear syntax
  • Easy multidimensional approach
  • Equally spaced colors as a default

Aesthetics

Aesthetics allow one to map data to aesthetic aspects of the plot.

  • Size
  • Color
  • etc.

The function used in ggplot to do this is aes

aes(x=myvar, y=myvar2, color=myvar3, group=g)

Layers

In general, we start with a base layer and add to it.

In most cases you’ll start as follows.

ggplot(aes(x=myvar, y=myvar2), data=mydata)

This would just produce a plot background.

Piping

Layers are added via piping.

The first layers added are typically geoms:

  • points
  • lines
  • density
  • text

ggplot2 was using pipes before it was cool, and so it has a different pipe.

Otherwise, the concept is the same as before.

ggplot(aes(x=myvar, y=myvar2), data=mydata) +
  geom_point()

And now we would have a scatterplot.

Examples

library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point()

Examples

ggplot(aes(x=date, y=unemploy), data=economics) +
  geom_line()

Examples

In the following, one setting is not mapped to the data.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(size=carat, color=clarity), alpha=.25) 

Stats

There are many statistical functions built in.

Key strength: you don’t have to do much preprocessing.

Quantile regression lines:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_quantile()

Stats

Loess (or additive model) smooth:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth()

Stats

Bootstrapped confidence intervals:

ggplot(mtcars, aes(cyl, mpg)) + 
  geom_point() +
  stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)

Facets

Facets allow for paneled display, a very common operation.

In general, we often want comparison plots.

facet_grid will produce a grid.

  • Often this is all that’s needed

facet_wrap is more flexible.

Both use a formula approach to specify the grouping.

facet_grid

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_grid(vs ~ cyl, labeller = label_both)

facet_wrap

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)

Fine control

ggplot2 makes it easy to get good looking graphs quickly.

However the amount of fine control is extensive.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(color=clarity), alpha=.5) + 
  scale_y_log10(breaks=c(1000,5000,10000)) +
  xlim(0, 10) +
  scale_color_brewer(type='div') +
  facet_wrap(~cut, ncol=3) +
  theme_minimal() +
  theme(axis.ticks.x=element_line(color='darkred'),
        axis.text.x=element_text(angle=-45),
        axis.text.y=element_text(size=20),
        strip.text=element_text(color='forestgreen'),
        strip.background=element_blank(),
        panel.grid.minor=element_line(color='lightblue'),
        legend.key=element_rect(linetype=4),
        legend.position='bottom')



Themes

In the last example you saw two uses of a theme.

  • built-in
  • specific customization

Each argument takes on a specific value or an element function:

  • element_rect
  • element_line
  • element_text
  • element_blank

Themes

The base theme is not too good.

  • not for web
  • doesn’t look good for print either

You will almost invariably need to tweak it.

Extensions

ggplot2 now has its own extension system.

There is even a website to track the extensions.

Examples include:

  • additional themes
  • interactivity
  • animations
  • marginal plots
  • network graphs

Summary ggplot2

ggplot2 is an easy to use, but powerful visualization tool.

Allows one to think in many dimensions for any graph:

  • x
  • y
  • color
  • size
  • opacity
  • facet

2d graphs are only useful for conveying the simplest of ideas.

Use ggplot2 to easily create more interesting visualizations.

Packages

ggplot2 is the most widely used package for visualization in R.

However, it is not interactive by default.

Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.

Packages

General:

  • plotly
    • used also in Python, Matlab, Julia, can convert ggplot2 images to interactive ones.
  • ggvis
    • interactive successor to to ggplot though not currently actively developed
  • rbokeh
    • like plotly, it also has cross program support

Specific functionality:

  • DT
    • interactive data tables
  • leaflet
    • maps with OpenStreetMap
  • dygraphs
    • time series visualization
  • visNetwork
    • Network visualization

Piping for Visualization

One of the advantages to piping is that it’s not limited to dplyr style data management functions.

Any R function can be potentially piped to.

  • several examples have already been shown.

This facilitates data exploration, especially visually.

  • don’t have to create objects
  • new variables are easily created and subsequently manipulated just for vis
  • data manipulation not separated from visualization

htmlwidgets

Many newer visualization packages take advantage of piping.

htmlwidgets is a package that makes it easy to create javascript visualizations.

  • i.e. what you see everywhere on the web.

The packages using it typically are pipe-oriented and produce interactive plots.

plotly example

A couple demonstrations with plotly.

Note the layering as with ggplot2.

Piping used before plotting.

library(plotly)
midwest %>% 
  filter(inmetro==T) %>% 
  plot_ly(x=percollege, y=percbelowpoverty, mode='markers') 

plotly example

plotly has modes, which allow for points, lines, text and combinations.

Traces work similar to geoms.

library(mgcv)

mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  plot_ly(x=wt, y=mpg, color=amFactor, width=800, height=500, mode='markers') %>% 
  add_trace(x=wt, y=prediction, alpha=.5, hover=hovertext, name='gam prediction')

plotly example

ggplotly

The nice thing about plotly is that we can feed a ggplot to it.

It would have been easier to use geom_smooth, so let’s do so.

gp = mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  ggplot(aes(x=wt, y=mpg)) +
  geom_smooth() +
  geom_point(aes(color=amFactor))
ggplotly(width='auto')

dygraphs

dygraphs is useful for time-series.

  • Uses the dygraphs.js library
library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>% 
  dygraph(width=800) %>% 
  dyOptions(stackedGraph = TRUE, colors=RColorBrewer::brewer.pal(3, name='Dark2')) %>%
  dyRangeSelector(height = 20)

visNetwork

visNetwork allows for network visualizations

  • Uses the vis.js library
library(visNetwork)
visNetwork(nodes, edges, height=600, width=800) %>% 
  visNodes(shape='circle', 
           font=list(), 
           scaling=list(min=10, max=50, label=list(enable=T))) %>% 
  visLegend()

data table

Use the DT package for interactive dataframes.

library(DT)
movies %>% 
  select(1:6) %>% 
  filter(rating>9) %>% 
  slice(sample(1:nrow(.), 50)) %>% 
  datatable(rownames=F)

Shiny

Shiny is a framework that can essentially allow you to build an interactive website.

  • Provided by RStudio developers

Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.

Interactive and Visual Data Exploration

Interactivity allows for even more dimensions to be brought to a graphic.

Interactive graphics are more fun too!

  • But they must serve a purpose
  • Too often they are simply distraction, and detract from the data story

Just a couple visualization packages can go a very long way.

Summary

With the right tools, data exploration can be:

  • easier
  • faster
  • more efficient

Use them to wring your data dry of what it has to offer.



Embrace a richer understanding of your data!